arXiv:1506.05865v4 [cs.CL] 19 Feb 2016 


LCSTS: A Large Scale Chinese Short Text Summarization Dataset 


Baotian Hu Qingcai Chen Fangze Zhu 

Intelligent Computing Research Center 
Harbin Institute of Technology, Shenzhen Graduate School 

{baotianchina^ qingcai.chen^ zhufangzel23}@gmail.com 


Abstract 


Automatic text summarization is widely 
regarded as the highly difficult problem, 
partially because of the lack of large 
text summarization data set. Due to the 
great challenge of constructing the large 
scale summaries for full text, in this pa¬ 
per, we introduce a large corpus of Chi¬ 
nese short text summarization dataset con¬ 
structed from the Chinese microblogging 
website Sina Weibo, which is released to 
the public0 This corpus consists of over 
2 million real Chinese short texts with 
short summaries given by the author of 
each text. We also manually tagged the 
relevance of 10,666 short summaries with 
their corresponding short texts. Based on 
the corpus, we introduce recurrent neural 
network for the summary generation and 
achieve promising results, which not only 
shows the usefulness of the proposed cor¬ 
pus for short text summarization research, 
but also provides a baseline for further re¬ 
search on this topic. 

1 Introduction 


Nowadays, individuals or organizations can eas¬ 
ily share or post information to the public on the 
social network. Take the popular Chinese mi¬ 
croblogging website (Sina Weibo) as an example, 
the People’s Daily, one of the media in China, 
posts more than tens of weibos (analogous to 
tweets) each day. Most of these weibos are well- 
written and highly informative because of the text 
length limitation (less than 140 Chinese charac¬ 
ters). Such data is regarded as naturally annotated 
web resources ( |Sun, 20 iT] ). If we can mine these 
high-quality data from these naturally annotated 
web resources, it will be beneficial to the research 
that has been hampered by the lack of data. 

^ http ://icrc .hitsz. edu. cn/Article/show/13 9 .html 


Figure 1: A Weibo Posted by People’s Daily. 


In the Natural Language Processing (NLP) 
community, automatic text summarization is a hot 
and difficult task. A good summarization system 
should understand the whole text and re-organize 
the information to generate coherent, informative, 
and significantly short summaries which convey 


important information of the original text (Hovy 


and Lin, 1998| ), ( [Martins, 2007| ). Most of tradi¬ 


tional abstractive summarization methods divide 
the process into two phrases ( [Bing et al., 2015| ). 
First, key textual elements are extracted from the 
original text by using unsupervised methods or lin¬ 
guistic knowledge. And then, unclear extracted 
components are rewritten or paraphrased to pro¬ 
duce a concise summary of the original text by 
using linguistic rules or language generation tech¬ 
niques. Although extensive researches have been 
done, the linguistic quality of abstractive sum¬ 
mary is still far from satisfactory. Recently, deep 
learning methods have shown potential abilities 
to learn representation ( |Hu et al., 2014[ [Zhou et 


al., 2015) and generate language (Bahdanau et al., 


2014[[Sutskever et al., 2014| ) from large scale data 


by utilizing GPUs. Many researchers realize that 
we are closer to generate abstractive summariza- 
tions by using the deep learning methods. How¬ 
ever, the publicly available and high-quality large 
scale summarization data set is still very rare and 
not easy to be constructed manually. For exam¬ 
ple, the popular document summarization dataset 
DUC[^ TAC0 and TREC|^ have only hundreds of 
human written English text summarizations. The 
problem is even worse for Chinese. In this pa- 


^http://duc.nist.gov/data.html 
^ http: //WWW. nist. gov/tac/2015/KB P/ 
"^http://trec.nist.gov/ 
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Figure 2: Diagram of the process for creating the dataset. 


per, we take one step back and focus on construct¬ 
ing LCSTS, the Large-scale Chinese Short Text 
Summarization dataset by utilizing the naturally 
annotated web resources on Sina Weibo. Figure 
shows one weibo posted by the People’s Daily. In 
order to convey the import information to the pub¬ 
lic quickly, it also writes a very informative and 
short summary (in the blue circle) of the news. 
Our goal is to mine a large scale, high-quality short 
text summarization dataset from these texts. 

This paper makes the following contributions: 
(1) We introduce a large scale Chinese short text 
summarization dataset. To our knowledge, it is 
the largest one to date; (2) We provide standard 
splits for the dataset into large scale training set 
and human labeled test set which will be easier for 
benchmarking the related methods; (3) We explore 
the properties of the dataset and sample 10,666 
instances for manually checking and scoring the 
quality of the dataset; (4) We perform recurrent 
neural network based encoder-decoder method on 
the dataset to generate summary and get promis¬ 
ing results, which can be used as one baseline of 
the task. 


2 Related Work 


Our work is related to recent works on automatic 
text summarization and natural language process¬ 
ing based on naturally annotated web resources, 
which are briefly introduced as follows. 

Automatic Text Summarization in some form 
has been studied since 1950. Since then, most re¬ 
searches are related to extractive summarizations 
by analyzing the organization of the words in the 


document (Nenkova and McKeown, 2011) (Luhn, 


|1998| ); Since it needs labeled data sets for su¬ 
pervised machine learning methods and labeling 
dataset is very intensive, some researches focused 


on the unsupervised methods ( [Mihalcea, 2004[ ). 
The scale of existing data sets are usually very 


small (most of them are less than 1000). For 
example, DUC2002 dataset contains 567 docu¬ 
ments and each document is provided with two 
100-words human summaries. Our work is also 
related to the headline generation, which is a task 
to generate one sentence of the text it entitles. 
Colmenares et.al construct a 1.3 million financial 
news headline dataset written in English for head¬ 
line generation ( Colmenares et al., 2015[ ). How¬ 
ever, the data set is not publicly available. 

Naturally Annotated Web Resources based 
Natural Language Processing is proposed by 
Sun ( |Sun, 20 iT] ). Naturally Annotated Web Re¬ 
sources is the data generated by users for commu¬ 
nicative purposes such as web pages, blogs and 
microblogs. We can mine knowledge or useful 
data from these raw data by using marks generated 
by users unintentionally. Jure et.al track 1.6 mil¬ 
lion mainstream media sites and blogs and mine a 
set of novel and persistent temporal patterns in the 
news cycle ( [Leskovec et al., 200^ . Sepandar et.al 
use the users’ naturally annotated pattern ‘we feel’ 
and ‘i feel’ to extract the ‘Feeling’ sentence collec¬ 
tion which is used to collect the world’s emotions. 
In this work, we use the naturally annotated re¬ 
sources to construct the large scale Chinese short 
text summarization data to facilitate the research 
on text summarization. 


3 Data Collection 

A lot of popular Chinese media and organizations 
have created accounts on the Sina Weibo. They 
use their accounts to post news and information. 
These accounts are verified on the Weibo and la¬ 
beled by a blue ‘V’. In order to guarantee the qual¬ 
ity of the crawled text, we only crawl the verified 
organizations’ weibos which are more likely to be 
clean, formal and informative. There are a lot of 
human intervention required in each step. The pro¬ 
cess of the data collection is shown as Figurej^and 























































summarized as follows: 

1) We first collect 50 very popular organiza¬ 
tion users as seeds. They come from the domains 
of politic, economic, military, movies, game and 
etc, such as People’s Daily, the Economic Observe 
press, the Ministry of National Defense and etc. 2) 
We then crawl fusers followed by these seed users 
and filter them by using human written rules such 
as the user must be blue verified, the number of 
followers is more than 1 million and etc. 3) We 
use the chosen users and text crawler to crawl their 
weibos. 4) we filter, clean and extract (short text, 
summary) pairs. About 100 rules are used to ex¬ 
tract high quality pairs. These rules are concluded 
by 5 peoples via carefully investigating of the raw 
text. We also remove those paris, whose short text 
length is too short (less than 80 characters) and 
length of summaries is out of [10,30]. 

4 Data Properties 

The dataset consists of three parts shown as Ta¬ 
ble and the length distributions of texts are 
shown as Figure 

Part I is the main content of LCSTS that con¬ 
tains 2,400,591 (short text, summary) pairs. These 
pairs can be used to train supervised learning 
model for summary generation. 

Part II contains the 10,666 human labeled 
(short text, summary) pairs with the score ranges 
from 1 to 5 that indicates the relevance between 
the short text and the corresponding summary. ‘1’ 
denotes “ the least relevant ” and ‘5’ denotes “the 
most relevant”. For annotating this part, we recruit 

5 volunteers, each pair is only labeled by one an¬ 
notator. These pairs are randomly sampled from 
Part I and are used to analysize the distribution of 
pairs in the Part I. Figure [^illustrates examples of 
different scores. From the examples we can see 
that pairs scored by 3, 4 or 5 are very relevant to 
the corresponding summaries. These summaries 
are highly informative, concise and significantly 
short compared to original text. We can also see 
that many words in the summary do not appear 
in the original text, which indicates the significant 
difference of our dataset from sentence compres¬ 
sion datasets. The summaries of pairs scored by 
1 or 2 are highly abstractive and relatively hard to 
conclude the summaries from the short text. They 
are more likely to be headlines or comments in¬ 
stead of summaries. The statistics show that the 
percent of score 1 and 2 is less than 20% of the 



Figure 3: Box plot of lengths for short text(ST), 
segmented short text(Segmented ST), sum- 
mary(SUM) and segmented sunimary(Segmented 
SUM). The red line denotes the median, and the 
edges of the box the quartiles. 

data, which can be filtered by using trained classi¬ 
fier. 

Part III contains 1,106 pairs. For this part, 3 
annotators label the same 2000 texts and we ex¬ 
tract the text with common scores. This part is 
independent from Part I and Part II. In this work, 
we use pairs scored by 3, 4 and 5 of this part as the 
test set for short text summary generation task. 


Part I 

2,400,591 

Part II 

Number of Pairs 

10,666 

Human Score 1 

942 

Human Score 2 

1,039 

Human Score 3 

2,019 

Human Score 4 

3,128 

Human Score 5 

3,538 

Part III 

Number of Pairs 

1,106 

Human Score 1 

165 

Human Score 2 

216 

Human Score 3 

227 

Human Score 4 

301 

Human Score 5 

197 


Table 1: Data Statistics 


5 Experiment 

Recently, recurrent neural network (RNN) have 
shown powerful abilities on speech recogni¬ 
tion ( [Graves et al., 2013| ), machine transla¬ 
tion (jSutskever et al., 2014|) and automatic dialog 


response ( jShang et al., 2015] ). However, there is 
rare research on the automatic text summarization 
by using deep models. In this section, we use RNN 
as encoder and decoder to generate the summary 
of short text. We use the Part I as the training set 

















































Short Text: m El itS, ^ml 

^K t M 7jc ^ -ftb }% o 

Mingzhong Chen, the Chief Secretary of the Water Devision of the Ministry of 
Water Resources, revealed today at a press conference, according to the just- 
completed assessment of water resources management system, some 
provinces are closed to the red line indicator, some provinces are over the red 
line indicator. In some places over the red line, It will enforce regional 
approval restrictions on some water projects, implement strictly water 
resources assessment and the approval of water licensing. 

Summarization: zKIl a##:F^-frb 

Some provinces exceeds the red line indicator of annual water using, some 
water project will be. limited approved 

Human Score: 5 


Short Text: ^13] ^ fb^^^30%vXT, if] Z'it it PC^ 5!^ 4^ 

Groupons' sales on mobile terminals are below 30 percent. User's preference of 
shopping through PCs can not be changed in the short term. In the future 
Chinese 020 catering market, mobile terminals will become the strategic 
development direction. And also, it will become off-line driving from on-line 
driving. The first and second tier cities are facing growth difficulties. However, 
020 market in the third and fourth tier cities contains opportunities. 
Summarization: ^ 

The mobile terminals will become catering's strategic development direction. 
Human Score: 4 

Short Text: 7^ ^ii- 103475T' tb Ji 

0.87%, /"Z'U 4b^. ^ 

^10% o ^ 

In July, 100-cities' average newly-built house prices is 10347 yuan per square, 
which rose 0.87%. It rises for the 14**^ consecutive month. Among them, 
Guangzhou, Beijing, Shenzhen, Nanjing rise more than 10%. Dawei Zhang, from 
Centaline Property Agency, said that because the first and second-tier city 
gathers too many resources, the price of house is likely to rise and hard to fall. 
Summarization: k 

100-cities' house prices gain "14th consecutive rising", the first and second-tier 
cities rise more. 

Human Score: 3 


Short Text: 

2d'Ht, 201445^6EI ^2014050£qj^i£Tf^i£4d'Hto 

Reporters combed the information and found, from 2009 to now, there are at 
least 8 lottery delayed events and the delayed time are more than 2 hours. On 
May 6, 2014, the No. 2014050 delay more than 4 hours. The center of welfare 
lottery only respond to 3 of the 8 event. Their explanations are that a 
communications breakdown and heavy rain led to a data upload extension. 
There are no explanations for other 5 delay events. 
Summarization:^l«j7il^j4bTf4l^i£: 

Ask about the lottery delay third times: why lottery should wait data collection? 

Human Score: 2 

Short Text: t ®7^ jf]3'hf^|^tb4ctsTFfl6.95%^ 

78.i4z-^7Lo 

o 

According to China's Ministry of Commerce, China's actually utilized foreign 
capital in Julyfell sharply about 16.95% to 7.81 billion dollars, comparing to last 
year. Analysis of the outside world believe that it is related to the recent official 
intensive antitrust investigation. DanyangShen responded, "It can not be linked 
to the antitrust investigation of foreign investment, or do other unfounded 
association" 

Summarization: 

China's Ministry of Commerce respond to antitrust investigation: Several cases 
will not scare foreign investors away. 

Human Score: 1 


Figure 4: Five examples of different scores. 

and the subset of Part III, which is scored by 3, 4 
and 5, as test set. 

Two approaches are used to preprocess the data: 
1) character-based method, we take the Chinese 
character as input, which will reduce the vocab¬ 
ulary size to 4,000. 2) word-based method, the 
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Figure 5: The graphical depiction of RNN encoder 
and decoder framework without context. 
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Figure 6: The graphical depiction of the RNN en¬ 
coder and decoder framework with context. 



text is segmented into Chinese words by using 
JiebiQ The vocabulary is limited to 50,000. We 
adopt two deep architectures, 1) The local con¬ 
text is not used during decoding. We use the 
RNN as encoder and it’s last hidden state as the 
input of decoder, as shown in Figure 2) The 
context is used during decoding, following ( |Bah- 
danau et al., 2014| ), we use the combination of 
all the hidden states of encoder as input of the 
decoder, as shown in Figure For the RNN, 
we adopt the gated recurrent unit (GRU) which is 
proposed by ( [Chung et al., 2015] ) and has been 
proved comparable to LSTM ( Chung et al., 2014| ). 
All the parameters (including the embeddings) of 
the two architectures are randomly initialized and 
ADADELTA ( Zeiler, 2012| ) is used to update the 
learning rate. After the model is trained, the beam 
search is used to generate the best summaries in 
the process of decoding and the size of beam is set 
to 10 in our experiment. 


^https://pypi.python.org/pypi/jieba/ 











































































model 

data 

R-1 

R-2 

R-L 

RNN 

Word 

0.177 

0.085 

0.158 

Char 

0.215 

0.089 

0.186 

RNN context 

Word 

0.268 

0.161 

0.241 

Char 

0.299 

0.174 

0.272 


Table 2: The experiment result: “Word” and 
“Char” denote the word-based and character- 
based input respectively. 

Short Text: 9>g3El, J <<2014-20154 

4 -» , 

+a#^^ 28 <i, 

On September 3, the Geneva-based World Economic Forum released “The Global 
Competitiveness Report 2014-2015* . Switzerland topped the list for six consecutive years , 
becoming the world ‘s most competitive country. Singapore and the United States are in the 
second and third place respectively. China is in the 28* place, ranking higjiest among the BRIC 
countries. 

Human: + a;g-2 8 <£ 

The Global competitiveness ranking list, China is in the 28* place, the hipest among BRIC 
countries. 

RNN+Char: ^ ^ ^ ^^4-: 

^M^ 26 {k^^M^ 26 <k 

“The Global Competitiveness Report* : Switzerland ranks 6* in 26* ranks 3^** ranks 26* 6* ranks 
26* place 6* place ranks 26* place 6* place ranks 26* place 

RNN+Word: 20144 ± ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^68^-^ 

2014 Global competitiveness ranking; Switzerland ranks first place, Beijing ranks ranks ranks ranks 
list ranks ranks ranks list ranks 68* world ranks ranks ranks list ranks 68* . 

RNN+Context+Char: 

The Global competitiveness ranking list: China is in the 28* place which is highest among the 
BRIC countries. 

RNN+Context+Word: 2014-20154 it±^4: i’ ^ Mr28U{t 

/3- i^4^t)f^MU$28<k 

“The Global Competitiveness Report 2014-2015* : Switzerland topped the list for six 

consecutive years, China is in the 28th place ( can not be translated) China ranks 28*. 


Figure 7: An example of the generated summaries. 


For evaluation, we adopt the ROUGE metrics 


proposed by ( |Lin and Hovy, 2003| ), which has 
been proved strongly correlated with human eval¬ 
uations. ROUGE-1, ROUGE-2 and ROUGE-L 
are used. Because the standard Rouge package 
is used for evaluating English summarization sys¬ 
tems, we transform the Chinese words to numeri¬ 
cal IDs to adapt to the systems. All the models are 
trained on the GPUs tesla M2090 for about one 
week.Table lists the experiment results. As we 
can see in Figure the summaries generated by 
RNN with context are very close to human written 
summaries, which indicates that if we feed enough 
data to the RNN encoder and decoder, it may gen¬ 
erate summary almost from scratch. 

The results also show that the RNN with con¬ 
text outperforms RNN without context on both 
character and word based input. This result indi¬ 
cates that the internal hidden states of the RNN 
encoder can be combined to represent the context 
of words in summary. And also the performances 
of the character-based input outperform the word- 
based input. As shown in Figure the summary 
generated by RNN with context by inputing the 
character-based short text is relatively good, while 


^ http ://w w w.berouge. com/Pages/default. aspx 


Short Text: Xr, 

4if-o ” 7^4EiJi4, 

The door of the factory is locked. About 20 works are scattered to sit under the shade. “We are 
ordinary workers, we are waiting for our salary here. , one of them said. On the morning of July 
4th, reporters arrived at Shenzhen Yuanjing Photoelectron Corporation located on Qingjiu Road, 
Longjiua District, Shenzhen. Just as the rumor, Yuanjing Photoelectron Corporation is closed down 
and the large shareholder Xing Yi is missing. 

Human: El TXA^^:^-^ 

Hundred-million—Yuan class LED enterprise is closed down and workers wait for the boss under 
the scorching sun. 

RNN+Context+Char: T( @ ) <&.@) 

Shenzhen Yuanjing Photoelectron Corporation is closed down (piemre) (set of pictures) 

RNN+Context+Word: “UNK":v4i)'l UNK , UNK, UNK, UNK 

Shenzhen "UNK": Shenzhen UNK, UNK, UNK, UNK, UNK 

Figure 8: An example of the generated summaries 
with UNKs. 

the the summary generated by RNN with context 
on word-based input contains many UNKs. This 
may attribute to that the segmentation may lead to 
many UNKs in the vocabulary and text such as the 
person name and organization name. For exam¬ 
ple, is a company name which 

is not in the vocabulary of word-based RNN, the 
RNN summarizer has to use the UNKs to replace 
the in the process of decoding. 


6 Conclusion and Future Work 


We constructed a large-scale Chinese short text 
summarization dataset and performed RNN-based 
methods on it, which achieved some promising re¬ 
sults. This is just a start of deep models on this 
task and there is much room for improvement. We 
take the whole short text as one sequence, this may 
not be very reasonable, because most of short texts 
contain several sentences. A hierarchical RNN ( [Li 
et al., 20 T 5 ) ) is one possible direction. The rare 
word problem is also very important for the gener¬ 
ation of the summaries, especially when the input 
is word-based instead of character-based. It is also 
a hot topic in the neural generative models such 


as neural translation machine(NTM) (Luong et al.. 


2014[ ), which can benefit to this task. We also plan 
to construct a large document summarization data 
set by using naturally annotated web resources. 
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